Model Selection

High-Resolution Image Understanding

# High-Resolution Image Understanding

Eurovlm 9B Preview

EuroVLM-9B-Preview is a multimodal vision-language model based on the long-context version of EuroLLM-9B, supporting multiple languages and visual tasks. It is currently in the preview version.

Transformers Supports Multiple Languages

Janus-Pro is an innovative autoregressive framework that unifies multimodal understanding and generation capabilities. By decoupling visual encoding paths and employing a single Transformer architecture, it resolves conflicts in the roles of visual encoders between understanding and generation.

Paligemma2 28b Pt 896

PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, combining the capabilities of the Gemma 2 language model and SigLIP vision model, supporting image and text inputs to generate text outputs.

Paligemma2 28b Mix 448

PaliGemma 2 is a vision-language model based on Gemma 2, supporting image+text input and text output, suitable for various vision-language tasks.

Paligemma2 10b Pt 896

PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, integrating Gemma 2 capabilities, supporting image and text input to generate text output

Paligemma2 10b Pt 448

PaliGemma 2 is Google's upgraded vision-language model (VLM) that combines Gemma 2 capabilities, supporting image and text input to generate text output.

Paligemma2 3b Pt 448

PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text input to generate text output, suitable for various vision-language tasks.

Paligemma2 3b Ft Docci 448

PaliGemma 2 is an upgraded vision-language model released by Google, combining the capabilities of Gemma 2 and SigLIP vision models, supporting multilingual vision-language tasks.

Llama 3.1 8B Dragonfly V2

Dragonfly is a multimodal vision-language model fine-tuned with instructions based on Llama 3.1, supporting joint understanding and generation of images and text

Image-to-Text English

togethercomputer

Convllava JP 1.3b 1280

ConvLLaVA-JP is a Japanese vision-language model that supports high-resolution input and can engage in conversations about input images.

Transformers Japanese

Cogvlm2 Llama3 Chat 19B Int4

CogVLM2 is a multimodal dialogue model based on Meta-Llama-3-8B-Instruct, supporting both Chinese and English, with 8K context length and 1344*1344 resolution image processing capabilities.

Transformers English

360VL is an open-source large multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual text support capabilities.

Transformers Supports Multiple Languages

Cogvlm2 Llama3 Chinese Chat 19B

CogVLM2 is a multimodal large model built on Meta-Llama-3-8B-Instruct, supporting both Chinese and English with powerful image understanding and dialogue capabilities.

Transformers English

Cogvlm2 Llama3 Chat 19B

CogVLM2 is a multimodal large model built upon Meta-Llama-3-8B-Instruct, supporting image understanding and dialogue tasks with 8K context length and 1344x1344 image resolution processing capability.

Transformers English

360VL is a multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual dialogue capabilities.

Transformers Supports Multiple Languages

Paligemma 3b Pt 896

PaliGemma is a versatile lightweight vision-language model (VLM) that supports image and text inputs and generates text outputs. It has multilingual capabilities.

Paligemma 3b Ft Ocrvqa 448

PaliGemma is a versatile lightweight vision-language model (VLM) developed by Google, built on the SigLIP vision model and Gemma language model, supporting both image and text inputs with text outputs.

Xgen Mm Phi3 Mini Instruct R V1

xGen-MM is the latest foundational large multimodal model series developed by Salesforce AI Research, based on improvements to the BLIP series, featuring powerful image understanding and text generation capabilities.

Transformers English

Llava Llama 3 8b V1 1 Gguf

A multimodal model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image understanding and text generation

Llava Llama 3 8b V1 1 Transformers

A LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-text-to-text tasks

Yi-VL-34B is an open-source multimodal model from the Yi series, capable of understanding image content and engaging in multi-turn conversations, with outstanding performance on the MMMU and CMMMU benchmarks.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase